Handling Incomplete Categorical Data for Supervised Learning
نویسندگان
چکیده
Classification is an important research topic in knowledge discovery. Most of the researches on classification concern that a complete dataset is given as a training dataset and the test data contain all values of attributes without missing. Unfortunately, incomplete data usually exist in real-world applications. In this paper, we propose new handling schemes of learning classification models from incomplete categorical data. Three methods based on rough set theory are developed and discussed for handling incomplete training data. The experiments were made and the results were compared with previous methods making use of a few famous classification models to evaluate the performance of the proposed handling schemes.
منابع مشابه
Analysis of Dynamic Longitudinal Categorical Data in Incomplete Contingency Tables Using Capture-Recapture Sampling: A case Study of Semi-Concentrated Doctoral Exam
Abstract. In this paper, dynamic longitudinal categorical data and estimation of their parameters in incomplete contingency tables are evaluated. To apply the proposed method, a study has been conducted on the data of the semi-concentrated doctoral exam of the National Organization for Educational Testing (NOET). The results of studies such as the obtained confidence intervals and calculating t...
متن کاملMissing Data Imputation for Supervised Learning
This paper compares methods for imputing missing categorical data for supervised learning tasks. The ability of researchers to accurately fit a model and yield unbiased estimates may be compromised by missing data, which are prevalent in survey-based social science research. We experiment on two machine learning benchmark datasets with missing categorical data, comparing classifiers trained on ...
متن کاملA Simple Yet Fast Clustering Approach for Categorical Data
Categorical data has always posed a challenge in data analysis through clustering. With the increasing awareness about Big data analysis, the need for better clustering methods for categorical data and mixed data has arisen. The prevailing clustering algorithms are not suitable for clustering categorical data majorly because the distance functions used for continuous data are not applicable for...
متن کاملA Semi-supervised Learning Framework to Cluster Mixed Data Types
We propose a semi-supervised framework to handle diverse data formats or data with mixedtype attributes. Our preliminary results in clustering data with mixed numerical and categorical attributes show that the proposed semi-supervised framework gives better clustering results in the categorical domain. Thus the seeds obtained from clustering the numerical domain give an additional knowledge to ...
متن کاملA semi-supervised regression model for mixed numerical and categorical variables
In this paper, we develop a semi-supervised regression algorithm to analyze data sets which contain both categorical and numerical attributes. This algorithm partitions the data sets into several clusters and at the same time fits a multivariate regression model to each cluster. This framework allows one to incorporate both multivariate regression models for numerical variables (supervised lear...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006